With categorical predictors

Check what the mean values are per gender

With categorical predictors

One category is chosen as the base/reference level: the first alphabetically.

  • Other categories count as a distance of 1 from that category
  • beta = slope = change in y/change in x
  • if x = 1: change in y per category

With categorical predictors

mod_gender <- lm(weight_kg ~ gender,weight_data)
summary(mod_gender)
## 
## Call:
## lm(formula = weight_kg ~ gender, data = weight_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4169  -3.6867  -0.4045   3.3929  15.0021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.7536     0.4069  102.61   <2e-16 ***
## gendermale    7.1569     0.5909   12.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.368 on 329 degrees of freedom
## Multiple R-squared:  0.3084, Adjusted R-squared:  0.3063 
## F-statistic: 146.7 on 1 and 329 DF,  p-value: < 2.2e-16

With categorical predictors

What if we want the other to be the reference level?

gender_levels <- c("male", "female")

weight_data %>% 
  mutate(gender = factor(gender, levels=gender_levels)) %>% 
  lm(weight_kg ~ gender,.) %>% 
  summary
## 
## Call:
## lm(formula = weight_kg ~ gender, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4169  -3.6867  -0.4045   3.3929  15.0021 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   48.9105     0.4284  114.17   <2e-16 ***
## genderfemale  -7.1569     0.5909  -12.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.368 on 329 degrees of freedom
## Multiple R-squared:  0.3084, Adjusted R-squared:  0.3063 
## F-statistic: 146.7 on 1 and 329 DF,  p-value: < 2.2e-16

3+ categories?

3+ categories?

All betas are the different from the reference level

mod_iris <- lm(Sepal.Length ~ Species,iris) 
summary(mod_iris)
## 
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
## Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

What can we do if we want to know more than whether “versicolor” and “virginica” differ from “setosa”?

3+ categories?

First, try the simple way: rearrange your data to highlight what you want to know

3+ categories?

Slightly more complicated (and in this case, unnecessary) way: use a package to compare each pair

library(emmeans)
iris_species <- emmeans(mod_iris, "Species")
pairs(iris_species)
##  contrast               estimate    SE  df t.ratio p.value
##  setosa - versicolor      -0.930 0.103 147  -9.033 <.0001 
##  setosa - virginica       -1.582 0.103 147 -15.366 <.0001 
##  versicolor - virginica   -0.652 0.103 147  -6.333 <.0001 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Regression vs. anova/t-test/etc.

Multiple regression

y ~ intercept + bx + error

Why not:

\(y \sim b_0 + b_1*x_1 + b_2*x_2 + ... + \epsilon\)

Multiple regression (special case)

Starting with faux data

What’s the correlation between x1 and x2 here?

faux_data <- rnorm_multi(n = 100,
                  mu = c(3, 2, 1),
                  sd = c(3, 3, 3),
                  r = c(0.3, 0.5, 0),
                  varnames = c("outcome", "x1", "x2"),
                  empirical = TRUE)

Multiple regression (special case)

Multiple regression (special case)

lm(outcome ~ x1 + x2, faux_data) %>% 
  summary
## 
## Call:
## lm(formula = outcome ~ x1 + x2, data = faux_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4826 -1.5931 -0.0941  1.4871  6.3893 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.90000    0.30764   6.176 1.54e-08 ***
## x1           0.30000    0.08249   3.637 0.000444 ***
## x2           0.50000    0.08249   6.062 2.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.462 on 97 degrees of freedom
## Multiple R-squared:   0.34,  Adjusted R-squared:  0.3264 
## F-statistic: 24.98 on 2 and 97 DF,  p-value: 1.77e-09

Are the betas what you expect? The \(R^2\) = 0.34. This tells us how much variance in y we can predict, given x1 and x2 (more on this later)

Collinearity

What to do?

  • If very correlated, are they really telling you anything new?
  • Do they reflect the same underlying phenomenon?
    • Consider SEM
  • Is it theoretically reasonable to average them?
  • Is it theoretically interesting that one drops out?
  • Is it theoretically reasonable to leave one out?

Collinearity

Go back and model the weight data with 2 predictors: height and gender.

What happens?

Taking a step back: what do we want out of our models?

  • Reduce unexplained variance
  • More predictors will usually do that…
  • But what are the dangers?
    • overfitting
    • lack of generalizability
    • less parsimony
  • So what should I rely on?
    • THEORY